pacman::p_load(ggplot2, ggiraph, plotly,
patchwork, DT, tidyverse,
ggrepel, ggthemes, hrbrthemes,
tidyverse,ggstatsplot,pals,readxl,
performance, parameters, see) Take-home_Ex01
1. The task
The purpose of this take home exercise is to to reveal the demographic and financial characteristics of the city of Engagement, using appropriate static and interactive statistical graphics methods. This exercise requires a user-friendly and interactive solution that helps city managers and planners to explore the complex data in an engaging way and reveal hidden patterns.
2. Data preparation
2.1 Installing the data packages
ggplot2: A data visualization package for creating graphics in R. It provides a flexible and powerful framework for creating elegant and complex graphs. Documentation: https://ggplot2.tidyverse.org/
ggiraph: An extension to ggplot2 that allows you to add interactivity to your ggplot2 graphics. It provides a set of functions that enable you to create interactive tooltips and clickable regions on your plots. Documentation: https://davidgohel.github.io/ggiraph/index.html
plotly: A data visualization package that allows you to create interactive, web-based plots in R. It provides a wide range of chart types, including scatter plots, line charts, and heatmaps. Documentation: https://plotly.com/r/
patchwork: A package for combining multiple ggplots into a single plot. It provides a flexible and powerful framework for arranging, annotating, and styling your plots. Documentation: https://patchwork.data-imaginist.com/
DT: A package for creating interactive data tables in R. It provides a wide range of features, including sorting, filtering, and pagination. Documentation: https://rstudio.github.io/DT/
tidyverse: A collection of packages for data manipulation and visualization in R. It provides a consistent set of tools for working with data, including data cleaning, transformation, and visualization. Documentation: https://www.tidyverse.org/
ggrepel: A package for avoiding overplotting of text labels in ggplot2. It provides a set of functions for positioning labels in a way that minimizes overlap and maximizes readability. Documentation: https://ggrepel.slowkow.com/
ggthemes: A package for creating visually appealing themes for ggplot2 graphics. It provides a set of pre-designed themes that can be used to customize the appearance of your plots. Documentation: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/
hrbrthemes: A package for creating visually appealing themes for ggplot2 graphics. It provides a set of pre-designed themes that are designed to be aesthetically pleasing and easy to read. Documentation: https://hrbrmstr.github.io/hrbrthemes/
ggstatsplot: A package for creating publication-ready plots with statistical analysis built-in. It provides a set of functions for creating common statistical plots, including scatterplots, boxplots, and histograms, with statistical tests and summary information included. Documentation: https://indrajeetpatil.github.io/ggstatsplot/
pals: A package for creating color palettes for data visualization. It provides a set of functions for generating color palettes based on various themes and color schemes. Documentation: https://cran.r-project.org/web/packages/pals/vignettes/pals_examples.html
readxl: A package for reading Excel files in R. It provides a set of functions for importing data from Excel spreadsheets into R data frames. Documentation: https://readxl.tidyverse.org/
performance: A package for evaluating the performance of predictive models in R. It provides a set of functions for creating various types of performance metrics and visualizations. Documentation: https://easystats.github.io/performance/
parameters: A package for managing parameters and arguments in R. It provides a set of functions for defining, validating, and documenting parameters in R functions. Documentation: https://www.rdocumentation.org/packages/parameters/versions/0.21.0
see: A package for visualizing the output of R functions. It provides a set of functions for creating visual summaries of data frames, matrices, and other R objects. Documentation: https://cran.r-project.org/web/packages/see/index.html
2.2 Dataset introduction
The unzipped files have been saved into a new folder named data for better organization.
part_info <- read_csv("data/Participants.csv")Participants.csv
Contains information about the residents of City of Engagement that have agreed to participate in this study.
- participantId (integer): unique ID assigned to each participant.
- householdSize (integer): the number of people in the participant’s household
- haveKids (boolean): whether there are children living in the participant’s household.
- age (integer): participant’s age in years at the start of the study.
- educationLevel (string factor): the participant’s education level, one of: {“Low”, “HighSchoolOrCollege”, “Bachelors”, “Graduate”}
- interestGroup (char): a char representing the participant’s stated primary interest group, one of {“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”}. Note: specific topics of interest have been redacted to avoid bias.
- joviality (float): a value ranging from [0,1] indicating the participant’s overall happiness level at the start of the study.
finance <- read_csv("data/FinancialJournal.csv")FinancialJournal.csv
Contains information about financial transactions.
- participantId (integer): unique ID corresponding to the participant affected
- timestamp (datetime): the time when the check-in was logged
- category (string factor): a string describing the expense category, one of {“Education”, “Food”, “Recreation”, “RentAdjustment”, “Shelter”, “Wage”}
- amount (double): the amount of the transaction For explanation of Rent Adjustment, please refer to this link
2.3 Examining the datasets
Participant csv
Lets first examine the properties of the participants csv file.
datatable(part_info)str(part_info)spc_tbl_ [1,011 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ participantId : num [1:1011] 0 1 2 3 4 5 6 7 8 9 ...
$ householdSize : num [1:1011] 3 3 3 3 3 3 3 3 3 3 ...
$ haveKids : logi [1:1011] TRUE TRUE TRUE TRUE TRUE TRUE ...
$ age : num [1:1011] 36 25 35 21 43 32 26 27 20 35 ...
$ educationLevel: chr [1:1011] "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" ...
$ interestGroup : chr [1:1011] "H" "B" "A" "I" ...
$ joviality : num [1:1011] 0.00163 0.32809 0.39347 0.13806 0.8574 ...
- attr(*, "spec")=
.. cols(
.. participantId = col_double(),
.. householdSize = col_double(),
.. haveKids = col_logical(),
.. age = col_double(),
.. educationLevel = col_character(),
.. interestGroup = col_character(),
.. joviality = col_double()
.. )
- attr(*, "problems")=<externalptr>
Removed participant Id, education level and interestgroup from the following since summary are not meaningful
summary(part_info[, c("householdSize", "haveKids", "age","joviality")]) householdSize haveKids age joviality
Min. :1.000 Mode :logical Min. :18.00 Min. :0.000204
1st Qu.:1.000 FALSE:710 1st Qu.:29.00 1st Qu.:0.240074
Median :2.000 TRUE :301 Median :39.00 Median :0.477539
Mean :1.964 Mean :39.07 Mean :0.493794
3rd Qu.:3.000 3rd Qu.:50.00 3rd Qu.:0.746819
Max. :3.000 Max. :60.00 Max. :0.999234
Count of resident for each interest group
table(part_info$interestGroup)
A B C D E F G H I J
102 91 102 96 83 106 108 111 96 116
Count of resident for each education level
table(part_info$educationLevel)
Bachelors Graduate HighSchoolOrCollege Low
232 170 525 84
sum(is.na(part_info))[1] 0
duplicates1 <- duplicated(part_info)
part_info[duplicates1, ]# A tibble: 0 × 7
# ℹ 7 variables: participantId <dbl>, householdSize <dbl>, haveKids <lgl>,
# age <dbl>, educationLevel <chr>, interestGroup <chr>, joviality <dbl>
Since 0 rows are displayed. We can confirm that duplicates do not exist in this csv file.
Finance csv
Lets move on to examine the properties of the finance csv file.
Sneak peak of the first few entries in the dataset
head(finance)# A tibble: 6 × 4
participantId timestamp category amount
<dbl> <dttm> <chr> <dbl>
1 0 2022-03-01 00:00:00 Wage 2473.
2 0 2022-03-01 00:00:00 Shelter -555.
3 0 2022-03-01 00:00:00 Education -38.0
4 1 2022-03-01 00:00:00 Wage 2047.
5 1 2022-03-01 00:00:00 Shelter -555.
6 1 2022-03-01 00:00:00 Education -38.0
Please do not use datatable here or you will face a error of ‘Fatal javascript OOM in reached Heap Limit’ which indicate that R studio session has run out of memory while attempting to execute JavaScript code.
str(finance)spc_tbl_ [1,513,636 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ participantId: num [1:1513636] 0 0 0 1 1 1 2 2 2 3 ...
$ timestamp : POSIXct[1:1513636], format: "2022-03-01" "2022-03-01" ...
$ category : chr [1:1513636] "Wage" "Shelter" "Education" "Wage" ...
$ amount : num [1:1513636] 2473 -555 -38 2047 -555 ...
- attr(*, "spec")=
.. cols(
.. participantId = col_double(),
.. timestamp = col_datetime(format = ""),
.. category = col_character(),
.. amount = col_double()
.. )
- attr(*, "problems")=<externalptr>
Only ‘amount’ variable summary is displayed since summary of other variables are not useful
summary(finance[c("amount")]) amount
Min. :-1562.726
1st Qu.: -5.594
Median : -4.000
Mean : 20.047
3rd Qu.: 21.598
Max. : 4096.526
table(finance$category)
Education Food Recreation RentAdjustment Shelter
3319 790051 296013 131 11463
Wage
412659
sum(is.na(finance))[1] 0
There is ni missing values
Check for outlier in the amount variable. We first group the amount variables by the category. Then we do a box plot. From the chart we can observe that shelter has some abnormally small values to the negative end and wages has some exceptionally large values on the positive end. We may wish to take note of these in our analysis.
Show the code
# Create a box plot of amount by category
ggplotly(ggplot(finance, aes(x = category, y = amount, fill = category)) +
geom_boxplot() +
xlab("Expense Category") +
ylab("Amount") +
ggtitle("Amount Spent by Expense Category"))